feat: external IVF-PQ vector index over parquet (Phase 1) — RFC draft#1
Draft
sezruby wants to merge 23 commits into
Draft
feat: external IVF-PQ vector index over parquet (Phase 1) — RFC draft#1sezruby wants to merge 23 commits into
sezruby wants to merge 23 commits into
Conversation
…rmat#6848) ## Summary - allow MemWAL initialization on append-only tables without unenforced primary key metadata - keep bucket sharding constrained to the primary key when a primary key exists, but allow no-PK tables to bucket by a non-nested column - update Rust, Python, and Java tests plus MemWAL docs Fixes lance-format#6846 ## Tests - cargo fmt --all - cargo test -p lance test_initialize_mem_wal_bucket_sharding - uv run make build - uv run pytest python/tests/test_mem_wal.py::test_initialize_mem_wal_bucket_sharding_without_primary_key - git diff --check
…-format#6752) Carries on lance-format#5997 (and the benchmarking in discussion lance-format#5947), and follows up on lance-format#6728 where moving S3 Express away from O(n) manifest listing to a version hint was raised — picking that up here. ## What On object stores where `list` is **not** lexicographically ordered (e.g. S3 Express, the local filesystem), resolving the latest manifest version is O(n) in the number of versions. To avoid this, after every successful commit on such a store we write a small JSON file `_versions/latest_version_hint.json` with content `{"version":N}`. A reader then does a GET on the hint file plus a few HEAD probes (O(k), where k = versions added since the hint was written), and falls back to a full listing if the hint is missing (older datasets) or stale. - The hint is written/read **only on non-lexically-ordered stores**. On S3 Standard / GCS / Azure / OSS / Tencent / DynamoDB / memory the ordered listing already resolves the latest version in roughly one request, so the hint would only add a PUT per commit for nothing. - `current_manifest_path` uses the hint for non-lexically-ordered, non-local stores (the local filesystem keeps its existing single-directory-read fast path); `CommitHandler::list_manifest_locations_since` (used by `load_new_transactions`) follows the same strategy. - The hint write is **awaited** as part of the commit (no fire-and-forget mode). It is best-effort: failures are logged and ignored, since the hint only accelerates reads and never affects correctness — readers always verify the hinted version and probe upward from it. Detached versions are never written to the hint. - A transient (non-`NotFound`) object-store error while probing abandons the hint path so the caller falls back to a full listing rather than trust a possibly-stale or incomplete result. The gap-fill HEADs are bounded by `io_parallelism()`, and a far-behind reader (gap > 1000) falls back to a single paginated listing. ## Differences from lance-format#5997 - Only the JSON hint format is kept (the alternative file-size-encoded format and its env var are dropped). - The fire-and-forget / async hint-write mode is removed — the hint is always written synchronously, which keeps concurrent writes simpler with no meaningful latency cost. - The hint is gated to non-lexically-ordered stores, where it's actually read. - `current_manifest_path` picks one strategy based on the store rather than racing a HEAD-probe against a listing, keeping IO behavior deterministic. A `manifest_commit` benchmark is included to measure commit/load latency growth with many small fragments. Co-Authored-By: Jack Ye <yezhaoqin@gmail.com>
…ter (lance-format#6818) ## Problem `test_ann_prefilter` is flaky and failed on CI (`linux-arm`, Rust) — e.g. on the unrelated PR lance-format#6757 — with the HNSW+SQ parametrization returning a near-miss neighbor (row 10 instead of 6). ## Root cause HNSW node-level assignment uses an **unseeded** thread RNG (`rand::rng()`) in both the offline (`HnswBuilder`) and online (`OnlineHnswBuilder`) builders, so every index build produces a different random graph. On a tiny 300-vector dataset, an approximate HNSW+SQ search over a different graph each run can return a near neighbor instead of the exact one. `main` was green by luck of the RNG, not correctness. This is **not** caused by lance-format#6757 (the `String`→`Uuid` index-id refactor): index cache keys and on-disk index paths are byte-identical before/after that change; the test only surfaced the pre-existing flakiness. ## Fix - Seed both level-assignment sites with a shared fixed constant (`HNSW_LEVEL_RNG_SEED`) via `SmallRng`, making graph construction reproducible. Recall is statistically unaffected (identical level distribution; only the draws are fixed). A constant — rather than a new `HnswBuildParams` field — keeps the change contained (no serde/proto/binding changes). - Harden `test_ann_prefilter` to assert the property it actually validates (prefilter honored: `filterable > 5`) instead of an exact nearest-neighbor id, per the repo guideline that vector-index tests assert recall, not exact matches. Co-authored-by: Vova Kolmakov <wombatukun@apache.org> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… rows (lance-format#6774) ## Summary * Fixes lance-format#6735. * `transaction.rs` — `resolve_update_version_metadata`: in the per-row `created_at` mapping, check `row_id_to_source.contains_key(&rid)` before calling `resolve_created_at_version`. Rows not in the map are INSERT branch rows (no source in existing fragments); they now receive `new_version` as `created_at` instead of the previous fallback of `UNKNOWN_CREATED_AT_VERSION` (1). * `resolve_created_at_version` doc comment updated to clarify it is only called for UPDATE branch rows (source confirmed present). The unmapped-row-ID branch inside the function is unused when called from `resolve_update_version_metadata`; `UNKNOWN` (1) still applies for UPDATE rows whose source fragment has missing or bad `created_at_version_meta` (cache miss, decode failure, or out-of-range offset). * Two existing tests updated to assert the corrected behavior; one new test added. ## Background MERGE INTO commits through `Operation::Update` and produces both UPDATE branch rows (rewritten into `new_fragments` with a source row in the previous manifest, stable row ID present in `row_id_to_source`) and INSERT branch rows (new rows also in `new_fragments`, stable row ID assigned fresh, not present in any existing fragment). Before this change, `resolve_update_version_metadata` built the per-row `created_at_versions` vector by calling `resolve_created_at_version` for every row ID. For UPDATE branch rows that function correctly copies `created_at` from the source fragment. For INSERT branch rows the map lookup fails and the function returns `UNKNOWN_CREATED_AT_VERSION = 1`, producing a wrong historical version for every newly inserted row. CDF consumers cannot distinguish merge-inserted rows from updated rows via `_row_created_at_version`, and the value 1 is meaningless for rows that first appeared in a recent commit. The fix is a single guard at the call site: only call `resolve_created_at_version` for rows confirmed to have a source (UPDATE branch); for all other rows use `new_version` directly. ## Implementation notes * The guard uses `row_id_to_source.contains_key(&rid)`, which is an O(1) hash lookup on the same map already built for the UPDATE branch path — no additional data structures or iteration. * No lance-spark changes are needed. The Spark commit path (SparkPositionDeltaWrite) already attaches `RowIdMeta` to new fragment rows. This change activates the correct behavior automatically for all callers of `Operation::Update`, including lance-spark MERGE INTO. * lane-spark test update lance-format/lance-spark#530 ## Test plan * `test_update_version_tracking_insert_branch_gets_new_version` (renamed from `test_update_version_tracking_unknown_row_id_defaults_to_1`): new fragment with one UPDATE branch row (ID 10, source `created_at = 5`) and one INSERT branch row (ID 999); asserts `created_at = [5, 5]` — UPDATE branch copies from source, INSERT branch gets `new_version` (5). * `test_update_version_tracking_merge_into_distinguishes_insert_and_update_branch` (new): new fragment interleaves UPDATE branch rows (IDs 10, 11, source `created_at = 3`) and INSERT branch rows (IDs 500, 501); asserts `created_at = [3, 5, 3, 5]` to verify per-row correctness across both branches in the same fragment. * `test_update_version_tracking_no_row_id_meta_fallback`: assertion updated from `[1, 1, 1]` to `[5, 5, 5]` — a fragment with no `row_id_meta` gets fresh stable IDs assigned by `assign_row_ids`; those IDs have no source and are INSERT branch rows, so `created_at` equals `new_version`. * `test_update_version_tracking_source_fragment_no_created_at_defaults_to_1` (unchanged): confirms that UPDATE branch rows whose source fragment has no `created_at_version_meta` still fall back to `UNKNOWN` (1) — the remaining reachable path through `resolve_created_at_version`. Co-authored-by: Jing chen He <jingh@adobe.com>
## Problem Legacy branches, i.e. branches whose `BranchContents` were written without a persisted `branch_identifier`, currently deserialize through `BranchIdentifier::none()`. That fallback generates a fresh random UUID on each read, so the same unchanged branch can surface a different `branch_identifier` across repeated loads. This makes branch identity unstable in both Python and Java for legacy datasets. On the Python side, `branches.list()` / `branches_ordered()` expose `branch_identifier` directly, so callers that diff, cache, or snapshot branch metadata can observe false changes even when the branch itself has not changed. On the Java side, the same legacy branch can also appear with a different identifier across refreshes, which makes equality-style comparisons unstable as well. ## Summary - stabilize fallback branch identifiers for legacy branch metadata by replacing the missing-identifier sentinel with a deterministic synthetic UUID during branch metadata reads - keep the fallback logic localized to Rust branch metadata loading so Python and Java continues returning stable `branch_identifier` values without API shape changes - add a lightweight Rust regression test that exercises `BranchContents::from_path` on in-memory branch metadata and verifies stable repeated reads plus distinct identifiers for different branch names
Document the PR publishing requirements in `AGENTS.md` so agent-created PRs match the checks enforced by CI before they are opened or updated. This records two required gates: - PR titles must follow Conventional Commits because `.github/workflows/pr-title.yml` validates the title and body with commitlint. - PRs must run lint checks for every touched language surface before creation or update. Rust changes require `cargo fmt --all` and `cargo clippy --all --tests --benches -- -D warnings`; Python changes require the `python/AGENTS.md` environment workflow and `uv run make lint` from `python/`. If a required lint check cannot be run, the blocker must be stated explicitly in the PR summary.
…mat#6793) Makes BTree scalar index cache entries serializable, so a persistent cache backend can store and reload them without re-reading from storage. Previously the whole BTree index was cached as `Arc<dyn ScalarIndex>` under an `UnsizedCacheKey`, which can never carry a codec, and each `FlatIndex` page was cached in-memory only. Changes: - `CacheCodecImpl for FlatIndex` (one BTree page) and `BTreePageKey::codec()`. - Top-level scalar index caching becomes a plugin implementation detail via the existing `ScalarIndexPlugin::get_from_cache`/`put_in_cache` hooks. The default impl preserves today's in-memory unsized caching (backwards compatible); the BTree plugin overrides it with a sized, codec-backed `BTreeIndexState` (the lookup `RecordBatch` + `batch_size` + `ranges_to_files`, from which `try_from_serialized` rebuilds the index with no IO). - Caching moves into `scalar::open_scalar_index` (get → miss → load → put); the dataset-level `ScalarIndexCacheKey` logic is removed from `Dataset::open_scalar_index`. This keeps index-type-specific knowledge in `lance-index` rather than leaking a state trait + dispatch into `lance/src/index.rs`. Adds an integration test asserting that after prewarming with a serializing cache backend, an indexed-filter query does 0 read IOPS. Bitmap index will follow the same pattern in a separate PR. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cache vector index configuration within the index metadata, such as the distance type and build parameters. Previously, to determine things like the distance type or index type of a vector index, the index file itself had to be opened. This PR stores that information in `VectorIndexDetails` within the manifest's `index_details` field, which is fetched and cached eagerly when loading the manifest. Old indexes have this field left blank. When blank, the details are extracted from the index files and cached. This migration happens on the first write with a new library version. ## What's stored in VectorIndexDetails **Core build parameters** (typed fields — required for any runtime to build the index): - `metric_type` - `target_partition_size` (IVF) - `hnsw_index_config` — `max_connections`, `construction_ef`, `max_level` (HNSW) - `compression` — PQ/SQ/RQ/flat, including `num_bits`, `num_sub_vectors`, `rotation_type` **Runtime hints** (`map<string, string> runtime_hints`): Optional build preferences that don't affect index structure. Stored so a background rebuild process can reproduce the original configuration. Runtimes that don't recognize a key must silently ignore it. Only non-default values are written. Keys use reverse-DNS namespacing: `lance.*` for core Lance hints, other prefixes for runtime-specific hints (e.g., `lancedb.accelerator` for GPU acceleration in LanceDB Enterprise). Current `lance.*` hints: `lance.ivf.max_iters`, `lance.ivf.sample_rate`, `lance.ivf.shuffle_partition_batches`, `lance.ivf.shuffle_partition_concurrency`, `lance.pq.max_iters`, `lance.pq.sample_rate`, `lance.pq.kmeans_redos`, `lance.sq.sample_rate`, `lance.hnsw.prefetch_distance`, `lance.skip_transpose`. Also adds `apply_runtime_hints()` to read hints back into build params for future rebuild logic. Closes lance-format#5963 --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…es (lance-format#6874) Adds `CacheCodec` impls so Bitmap and LabelList index cache entries survive through a persistent cache backend, mirroring the BTree work in lance-format#6793. - `CacheCodecImpl for RowAddrTreeMap` (delegates to existing `serialize_into`/`deserialize_from`), so per-value bitmap entries cached under `BitmapKey` are codec-backed. - `BitmapIndexState` captures the value→offset map (Arrow IPC), the null bitmap, and the value type. `BitmapIndexPlugin` overrides `get_from_cache`/`put_in_cache` to store this sized state. - `LabelListIndexState` wraps an inner `BitmapIndexState` plus `list_nulls` and gets the same plugin-level codec treatment. - `open_scalar_index` skips the LabelList compatibility check on cache hits, so a fully-cached LabelList query no longer pays an extra `bitmap_page_lookup.lance` open per call. ## Tests - Unit codec round-trip for `BitmapIndexState` (empty + populated). - Integration tests `test_{bitmap,label_list}_prewarm_with_serializing_backend_serves_query_with_no_io` asserting zero IOPS after prewarm through a serializing cache backend. Closes lance-format#6744
…mat#6816) ## Problem In the LSM scanner, every query against an L0 (frozen/flushed) generation re-opens that generation's Lance dataset from object storage. There are **three** identical cold-open sites — scan (`planner.rs`), point lookup (`point_lookup.rs`), and vector search (`vector_search.rs`) — each doing `DatasetBuilder::from_uri(path).load()` with no session. Per query, per flushed generation, this pays: manifest version discovery + manifest read + decode, file-metadata decode, and scalar/vector index load. For an LSM tree, frozen generations are the single best caching target, yet they were the only data source paying full cold-open cost on every query. ## Key invariant Flush writes each generation **once** to a globally-unique, content-addressed path (`memtable/flush.rs`). Same path ⟹ same bytes, forever — a cache hit can never be stale. This is the rare cache that needs **no TTL and no invalidation for correctness**; pruning is desirable only to reclaim memory. ## Changes (OSS `lance`) Two complementary, independently-useful, opt-in pieces — defaults preserve existing behavior exactly: 1. **`with_session` plumbing** — thread an existing `Arc<Session>` into the scanner/planners so the first open of each generation populates and reuses the shared index + file-metadata caches. `LsmScanner::new` defaults this to the base table's session; `without_base_table` defaults to `None`. 2. **`FlushedDatasetCache`** — a `moka`-backed, single-flight cache of `Arc<Dataset>` keyed by resolved flushed path, owned and sized by the consumer and injected per-request. After the first open, every subsequent query for that generation is a pure `Arc::clone` with zero object-store I/O. `retain_paths(live_paths)` prunes retired generations at compaction (memory-only; correctness never depends on it). A single shared `open_flushed_dataset(path, session, cache)` helper replaces all three cold-open sites (repo rule: dedupe logic in 2+ places). `None`/`None` reproduces the original behavior precisely, so no existing test changes. `data_source.rs` / `collector.rs` are untouched — opening stays lazy inside the planner, preserving bloom-filter pruning on point lookups. Planner wiring uses chainable `with_session`/`with_flushed_cache` builder methods rather than constructor changes, keeping `new()` signatures (and every existing test/bench) untouched. ## Testing - New unit tests for `FlushedDatasetCache`: miss opens once; hit returns the same `Arc` (pointer eq); 16-way concurrent `get_or_open` opens exactly once (single-flight); `retain_paths` drops the right keys; no-cache path cold-opens each call. - Regression: full `mem_wal::scanner` suite (78 tests) passes untouched. - `cargo clippy -p lance --tests --benches` clean; `cargo fmt` clean. ## Notes The `sophon` consumer side (process-bootstrap cache ownership, scanner wiring, compaction `retain_paths`) is out of scope for this PR. Phase 1 (`with_session`) is independently shippable ahead of the cache. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an Arrow-native MemWAL sharding evaluator and exposes it through the Java API/JNI. - Evaluates MemWAL sharding specs against Arrow RecordBatch values for bucket, identity, and unsharded fields. - Resolves sharding source IDs through a Java-provided source-id-to-column map. - Adds Java-facing ShardingEvaluator returning an Arrow reader for the evaluated sharding key batch. This is needed by lance-spark to route writes using Lance's sharding semantics instead of duplicating Spark-side bucket logic.
## Summary - Add granular tracing targets for Lance event categories under `lance::events::...`. - Preserve the original tracing target in Python log passthrough so `LANCE_LOG` can filter individual event types directly. - Document the new event targets and add Python regression coverage for target-specific log filtering. ## Testing - `PATH=/Users/beinan/.rustup/toolchains/1.94.0-aarch64-apple-darwin/bin:$PATH cargo fmt --all -- --check` - `git diff --check` - `PATH=/Users/beinan/.rustup/toolchains/1.94.0-aarch64-apple-darwin/bin:$PATH cargo check -p lance-core -p lance-io -p lance-index -p lance -p lance-datafusion` - `PATH=/Users/beinan/.rustup/toolchains/1.94.0-aarch64-apple-darwin/bin:$PATH cargo check --manifest-path python/Cargo.toml` - `uv sync --python 3.12 --no-install-project` - `.venv/bin/maturin develop --uv -v` - `uv run --no-sync --python 3.12 pytest -q python/tests/test_log.py::test_lance_log_filters_trace_event_targets python/tests/test_tracing.py::test_tracing_callback` - `uv run --no-sync --python 3.12 ruff format --check python/tests/test_log.py python/tests/test_tracing.py` - `uv run --no-sync --python 3.12 ruff check python/tests/test_log.py python/tests/test_tracing.py` --------- Co-authored-by: Beinan Wang <beinanwang@microsoft.com>
Split from lance-format#6856 — vector-search portion. A primary key written multiple times into one memtable, or into both a memtable and an older generation, used to leak through as distinct rows: HNSW indexes every insert as its own graph node, so KNN could return both V1 and V2 of the same PK from a single source. Each per-source KNN now runs through `LsmSourceTagExec`, which appends `(_memtable_gen, _freshness)`. A single `LsmGlobalPkDedupExec` over the union keeps the row with the largest tuple per PK — newer generations win, ties fall to the normalized within-source order. This replaces the bloom-based `FilterStaleExec` design and is exact (no false-positive recall loss, no top-k under-fill). After global dedup + sort + top-k, a `TakeExec` materializes any user-projected columns not in the per-source KNN output by fetching from the base dataset via `_rowid`. `plan_search()` also accepts `refine_factor` so callers can enable base-table refine. Exposed in the Python and Java bindings. Removes `FilterStaleExec` and `GenerationBloomFilter`. Part of splitting lance-format#6856 into focused PRs. Co-authored with @jackye1995. Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
…e-format#6880) Split from lance-format#6856 — point-lookup portion. A primary key written multiple times into one active memtable used to leak through to the user as distinct rows: `FilterExec + LIMIT 1` over an insert-ordered scan returned the *oldest* match among duplicates. The active arm now runs `WithinSourceDedupExec(KeepMaxFreshness)`, which collapses by PK and keeps the freshest row. Flushed and base arms still rely on `LIMIT 1` under the reverse-write / forward-write conventions. Part of splitting lance-format#6856 into focused PRs. Co-authored with @jackye1995. Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
…ance-format#6799) ## Summary Fixes lance-format#5155. `InstrumentedRecordBatchStreamAdapter` measures a node's `elapsed_compute` by timing its outer `poll_next`. For an `ExecutionPlan` node with child inputs, that poll transitively polls every child — so `EXPLAIN ANALYZE` shows the parent's CPU as `parent + child + grandchild + ...`, and each ancestor double-counts its descendants. This PR fixes five nodes that have child inputs and were using the broken wrapper. The fix takes two shapes depending on the node: 1. **Four nodes** with a clean "child input -> per-batch transform -> output" shape are converted to use a new helper, `InstrumentedChildInputStream<F, Fut>` (in `rust/lance/src/io/exec/utils.rs`), modeled on DataFusion's `FilterExecStream`. The helper pulls from a child input **without** the timer running, then drives a per-batch async transform **with** the timer running. 2. **`FlatMatchQueryExec`** doesn't fit that shape — its work all happens inside `flat_bm25_search_stream`, which consumes the child input and `spawn_cpu`s the per-batch tokenize/count work internally. For this node, the fix instruments `flat_bm25_search_stream` directly so it can report CPU time on a metric handle supplied by the caller. ## Nodes fixed | Node | File | Mechanism | |---|---|---| | `AddRowAddrExec` | `rust/lance/src/io/exec/rowids.rs` | helper | | `MapIndexExec` | `rust/lance/src/io/exec/scalar_index.rs` | helper | | `FlatMatchFilterExec` | `rust/lance/src/io/exec/fts.rs` | helper | | `KNNVectorDistanceExec` | `rust/lance/src/io/exec/knn.rs` | helper + caller-side `Instant::now()` around `compute_distance(...).await` to capture `spawn_blocking` work | | `FlatMatchQueryExec` | `rust/lance/src/io/exec/fts.rs` + `rust/lance-index/src/scalar/inverted/index.rs` | new `flat_bm25_search_stream_with_metrics` that records CPU on a supplied `Time` handle | For `KNNVectorDistanceExec`, this also addresses the spawned-CPU undercount from lance-format#5155: the helper's timer doesn't measure work happening on `spawn_blocking` worker threads. KNN's transform closure wraps `compute_distance(...).await` with `Instant::now()` and adds the elapsed duration to `elapsed_compute` from the caller side. `compute_distance`'s public signature is unchanged. For `FlatMatchQueryExec`, the analogous mechanism lives in `lance-index`: `flat_bm25_search_stream_with_metrics` accepts an `Option<Time>` and records CPU around both the `spawn_cpu` body in `tokenize_and_count` (phase 1) and the synchronous `initialize_scorer` + `flat_bm25_score` (phase 2). The original `flat_bm25_search_stream` is preserved as a `#[deprecated]` thin wrapper that delegates with `None` — no breaking API change. ## Test plan ### Unit tests - **`utils.rs::instrumented_child_input_stream_excludes_child_poll_time`**: a child stream sleeps 60ms per `poll_next`; the transform sleeps 30ms. The helper's `elapsed_compute` should be `~3 × 30ms`, not `~3 × 90ms`. Verified to **fail** on the pre-fix implementation. - **`utils.rs::instrumented_child_input_stream_propagates_child_error`**: an error mid-stream from the child propagates through the helper without losing preceding OK batches. - **`fts.rs::test_flat_match_filter_find_matches_large_utf8`**: exercises the `i64`-offset specialization of `find_matches` (production path for long text columns). - **`lance-index/scalar/inverted/index.rs::flat_bm25_search_stream_with_metrics_records_elapsed_compute`**: runs `_with_metrics` with `Some(time)`, asserts `time.value() > 0` after drain. ### Existing tests ``` cargo test -p lance --lib io::exec::utils io::exec::knn io::exec::rowids io::exec::scalar_index io::exec::fts cargo test -p lance-index --lib scalar::inverted::index ``` All pass. `cargo fmt --all --check` and `cargo clippy -p lance --no-deps` are clean for changed files. ### End-to-end EXPLAIN ANALYZE before/after 1M-row synthetic dataset with BTREE + INVERTED + IVF_PQ indexes, comparing `elapsed_compute` between this branch's parent (pre-fix) and the branch tip: | Scenario | Node | Before | After | Change | |---|---|---|---|---| | KNN + BTREE scalar prefilter | `KNNVectorDistance` (493 rows out) | 623µs | 32µs | **~20× smaller** | | KNN + FTS post-filter | `FlatMatchFilter` (0 rows out) | 102µs | 38µs | **~2.7× smaller** | | KNN + FTS prefilter | `FlatMatchQuery` (2.45K rows out) | 932µs | 116ms | **was severely under-counted** (spawn_cpu worker CPU was invisible to the poll-based timer the new value reflects real cross-thread CPU and naturally exceeds wall-clock when work is parallel) | | KNN + FTS prefilter (control) | `MatchQuery` | unchanged | unchanged | noise only | The first two rows mirror the issue's described shape: heavy synchronous upstream inflating a downstream node's metric. The third row demonstrates the spawned-CPU half of the fix — `FlatMatchQuery`'s tokenize/score work was nearly invisible pre-fix and is now correctly attributed.
* API and namespace stubs for new create_materialized_view API * New REST route for materialized view refresh --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Unreleased version after creating v7.0.0-rc.1
…e-format#6871) The function `mask_to_offset_ranges` is used at scan planning time to determine which rows to read from the file. This was a bottleneck when the mask was the result of a zonal index search because the old implementation materialized all of the offsets only to convert them back into ranges. Luckily, roaring recently implemented a range-based iterator. Using this we can skip the materialization step. On my zonemap benchmark this doubles the speed of the search and, perhaps more importantly, removes a penalty I observed when the index is used even on queries that are not highly selective. Generated with the assistance of Claude code. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds Phase 1 of the External Vector Index RFC: build, open, search, and post-topK fetch_rows over a list of parquet files, with no Lance dataset dependency. Full design at sezruby/lance-spark#4. Rust surface (rust/lance/src/index/vector/external/): - ExternalIvfPqIndex: handle with build/open/search/fetch_rows - ParquetFileSpec, SearchResult, ParquetRowKey value types - ExternalIvfPqIndexParams builder - RowFilter trait (Delta DV / Iceberg position-deletes extensibility) - ParquetVectorSource: training sample + rid-annotated batch stream - manifest.json sidecar (parquet file list + build params) - parquet refinement via PageIndexPolicy::Required random access - List<Float32> coercion for JVM parquet writers (Spark emits List, not FixedSizeList, even when all rows have the same length) ivf.rs: new pub fn write_ivf_pq_file_external(&ObjectStore, &Path, ...) parallel to write_ivf_pq_file_from_existing_index but with no &Dataset parameter. Existing dataset-backed path becomes a thin wrapper. JNI surface (java/lance-jni/src/external_index.rs): - nativeBuild, nativeOpen, nativeClose, nativeSearch, nativeFetchRows - RowFilter exposed as a packed LE byte[] of (file_id<<32)|row_index deleted rids (sidesteps cross-language callbacks while still supporting Delta DV / Iceberg position deletes) - fetchRows returns Arrow IPC stream bytes Java surface (java/src/main/java/org/lance/index/external/): - ExternalIvfPqIndex (handle, AutoCloseable, JavaBean accessors) - ExternalIvfPqIndexParams (builder pattern) - SearchResult, ParquetRowKey value classes Tests: - rust/lance/tests/external_index_phase1.rs: end-to-end integration test (build → open → search recall → fetchRows projection in input order with duplicates → RowFilter exclusion). 16 left vectors × 3 parquet files of 320 rows; passing. - 14 unit tests across the new module. All passing. - 5 JNI smoke tests (handle load, exception bridge, packed-rid round trip, params builder). All passing.
This was referenced May 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft PR for tracking — see sezruby/lance-spark#4 for the full RFC + design + benchmark numbers.
Summary
Phase 1 of the External Vector Index RFC. Lets Lance build and query an IVF-PQ vector index directly over a list of caller-supplied parquet files, with no Lance dataset required for either build or query — including full refinement via Lance-owned parquet random access.
The user-facing primitive is "a Lance index file over a list of parquet files" — a self-contained artifact that doesn't depend on a Lance dataset and doesn't push parquet random-access code into the engine layer. Unlocks Lance vector search for Spark over parquet/Delta, DuckDB over parquet, in-memory engines, embedded use cases.
What's in this PR
Rust surface
rust/lance/src/index/vector/external/:mod.rs—ExternalIvfPqIndexhandle:build / open / search / fetch_rowstypes.rs—ParquetFileSpec,SearchResult,ParquetRowKey,RowFiltertraitparams.rs—ExternalIvfPqIndexParamsbuilderparquet_source.rs—ParquetVectorSource(training sample + rid-annotated batch stream); also handlesList<Float32>from JVM parquet writers (Spark's parquet writer doesn't preserveFixedSizeListeven when all rows have the same length)manifest.rs— sidecarmanifest.jsonwith parquet file list + build paramsbuild.rs— orchestrates KMeans → PQ codebook → IvfTransformer → shuffle → writeopen.rs— reads manifest + index file, decodes IVF + PQ from existing protobufsearch.rs— IVF probe + per-file parquet refinement viaPageIndexPolicy::Required; pluggableRowFilterfetch.rs— post-topK random row fetch with arbitrary projection (the killer feature for engine integration)rust/lance/src/index/vector/ivf.rs:pub async fn write_ivf_pq_file_external(&ObjectStore, &Path, ..., dataset_version: u64, ivf, pq, streams)— parallel to the existingwrite_ivf_pq_file_from_existing_indexbut with the&Datasetparameter dropped. Existing dataset-backed path becomes a thin wrapper.rust/lance/Cargo.toml:parquetpromoted from dev-dep to runtime dep.JNI + Java surface
java/lance-jni/src/external_index.rs:nativeBuild,nativeOpen,nativeClose,nativeSearch,nativeFetchRows, plus accessors fornumPartitions,numFiles,vectorColumn.RowFilterexposed as a packed LEbyte[]of(file_id << 32) | row_indexdeleted rids — sidesteps cross-language callbacks while still supporting Delta DV / Iceberg position deletes.fetchRowsreturns Arrow IPC stream bytes (decoded JVM-side viaArrowStreamReader).java/src/main/java/org/lance/index/external/:ExternalIvfPqIndex— handle,AutoCloseable, JavaBean accessors perjava/CLAUDE.mdstyleExternalIvfPqIndexParams— builder pattern withMetricenumSearchResult—(filePath, rowIndex, distance)value classParquetRowKey—(filePath, rowIndex)value classTests
rust/lance/tests/external_index_phase1.rs— end-to-end integration test: build → open → search (recall above threshold) → fetchRows projection in input order with duplicates →RowFilterexclusion. 16 left vectors × 3 parquet files of 320 rows.All passing.
Headline benchmark numbers
Cluster Spark 3.5 (8 × 4 cores × 16 GB executor pods). Configs:
kNearestJoin(the existing fallback path for non-Lance R)ExternalIvfPqIndexPer-query latency at K=10, |L|=100, dim=128, IVF=256, PQ=16:
* Earlier benchmark used un-indexed Lance-native R as "C". Re-running with indexed C this round; numbers will be folded into the issue.
E-warm is 30-33× faster than B-narrow at scale. E pays a fixed index build (~30s on cluster for 1M × dim=128 IVF=256 PQ=16) that amortizes after ~2 queries vs B-narrow.
The full benchmark methodology + cross-compile workflow (cargo zigbuild for linux x86-64 from macOS arm64) lives in sezruby/lance-spark#4.
What's NOT in this PR
append()/compact()for incremental index updates (RFC Phase 4)(file_path, num_rows). Cross-Spark-session persistent index reuse needs this; in-process driver-side cache is sufficient for the current lance-spark integration.Status
Marked DRAFT — this is for review of the design + API surface + integration approach. Not a merge candidate to upstream
lance-format/lanceuntil RFC discussion lands and any agreed-upon scope/API changes are folded in.References
sezruby/lance-spark)